Introduction

In this report, we consider a non-optimal subsampling approach when fitting a PIM on a large dataset. We first simulate a dataset (sample size, \(N = 250.000\)) under the normal linear model. We consider several models in which we increase the complexity. Then we apply a non-optimal subsampling algorithm. We record the estimated \(\beta\) parameters and the time needed (on the HPC infastructure) to estimate these. In the first section, we briefly explain the algorithm. Next we give the simlation details. Finally, we look at the results.

Non Optimal Subsampling

The main problem when fitting a Probabilistic Index Model (PIM) on a large dataset is the time needed to estimate the parameters. In order to limit the computational time, we first explore a non-optimal subsampling approach. Consider a set of subsampling probabilites \(\pi_i, i = 1,...,n\) assigned to all data points. We define \(\boldsymbol{\pi} = \{\pi_i\}_{i=1}^n\). Next we apply the following algorithm:

In this report, we consider the nonuniform subsampling probability \(\boldsymbol{\pi}^{\text{UNI}} = \{\pi_i = n^{-1}\}_{i=1}^n\) in step 1.

Simulations

Model 1

Data generation

We first generate data under the linear model:

\[ Y_i = \alpha X_i + \varepsilon_i \] with \(i = 1,...,n\) and \(\varepsilon_i\) are i.i.d. \(N({0,\sigma^2)}\) with \(\sigma^2 = 1\). We consider a sample size \(n = 250.000\). The predictor (X) is uniformly spaced between \([0.1,1]\). The value \(\alpha\) is set to 5. Using the probit link function, the corresponding PIM can be expressed as:

\[ \Phi^{-1}[P(Y\preceq Y'|X,X')] = \beta(X' - X) \] where \(\beta = \alpha/\sqrt{2\sigma^2}\). Hence the true value for \(\beta\) equals 3.5355339.

Model 2

For the second model, we choose \(\alpha = 10\), a predictor (X) uniformly spaced between \([0.1,10]\) and \(\sigma^2 = 25\). Hence, the true value for \(\beta =\) 1.4142136.

Model 3

The third model is a multiple regression model in which we take parameters from the analysis done in Przybylski and Weinstein (2017). In their study, the authors used linear regression to model the effect of (among other variables) smartphone usage in the weekdays and weekends on self-reported mental well being. To control for confounding effects, they included variables for sex, whether the participant was located in economic deprived areas and the ethnic background (minority group yes/no) in the model. The full model is given as:

\[ Y_i = \mu + \alpha_1 X_i + \gamma_1 Z_{1i} + \gamma_2 Z_{2i} + \gamma_3 Z_{31i} + \varepsilon_i \]

Based on a linear regression, we find the regression parameters, the proportions of the covariates (\(Z_j\)), the range of the predictor (\(X\), smartphone usage during weekdays measured on a Likert scale from 0-7) and the spread of the observations (\(\sigma = 9.51\)). These parameters are then used to generate normally i.i.d. data.

dataWB <- read.csv(file = paste(wd, '/OSF_Przybylski2017/data.csv', sep = ''))

# Smartphone screen usage during weekdays, corrected for sex, economic area and minority status
dataWB %>% select(mwbi, sp_wd, male, minority, deprived) %>% 
  filter(complete.cases(.)) %>%
  mutate(sp_wd_sq = sp_wd**2) %>% 
  lm(mwbi ~ sp_wd + male + minority + deprived, data = .) %>%
  summary(.)

Call:
lm(formula = mwbi ~ sp_wd + male + minority + deprived, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.356  -5.675   0.461   6.146  26.756 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 46.71698    0.05967  782.86  < 2e-16 ***
sp_wd       -0.43171    0.01183  -36.48  < 2e-16 ***
male         4.55003    0.05514   82.52  < 2e-16 ***
minority     0.30486    0.06514    4.68 2.87e-06 ***
deprived    -0.45068    0.05578   -8.08 6.55e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.11 on 116625 degrees of freedom
Multiple R-squared:  0.08225,   Adjusted R-squared:  0.08221 
F-statistic:  2613 on 4 and 116625 DF,  p-value: < 2.2e-16

# Spread of data
sd(dataWB$mwbi, na.rm = TRUE)
[1] 9.51486

# proportion of gender, minority and deprived
mean(dataWB$male)
[1] 0.475819
mean(dataWB$deprived)
[1] 0.4348
mean(dataWB$minority)
[1] 0.2407609

Then for each simulation, we generate a dataset through:

# Sample size
n <- 250000
# Regression parameters
alpha_0 <- 46.72
alpha_1 <- -0.43
# Control variables: sex (0 = female, 1 = male)
sex <- c(0,1)
  alpha_sex <- 4.55
# Economic area (deprived = 1, otherwise 0)
econArea <- c(0,1)
  alphaEcon <- -0.45
# Ethnic background (white = 0, otherwise 1)
ethnic <- c(0,1)
  alpha_ethnic <- 0.30
# Predictor is a discrete scale from 0 --> 7
u <- 7
# Sigma based on dataset
sigma <- 9.51

# Seed: note, in simulations, this seed depends on ID of simulation!
set.seed(123)

# Predictors: proportions come from observed proportions in study of mental well being
X_smartph_hrs <- sample(x = c(0:u), size = n, replace = TRUE)
X_sex <- sample(x = sex, size = n, replace = TRUE, prob = c(1-0.4758, 0.4758))
X_econArea <- sample(x = econArea, size = n, replace = TRUE, prob = c(1-0.4348, 0.4348))
X_ethnic <- sample(x = ethnic, size = n, replace = TRUE, prob = c(1-0.2408, 0.2408))

 # Observed data
Y <- alpha_0 + alpha_1*X_smartph_hrs + alpha_sex*X_sex +
       alphaEcon*X_econArea + alpha_ethnic*X_ethnic +
        rnorm(n = n, mean = 0, sd = sigma)

The true value for \(\beta =\) -0.0319722.

For comparisson, we plot the distribution of mental well being in the dataset (left) and \(n = 250.000\) simulated datapoints (right).

par(mfrow = c(1,2))
hist(dataWB$mwbi, main = "Mental well being in Przybylski and Weinstein (2017)", xlab = "")
hist(Y, main = "Simulated mental well being", xlab = "")

Note that we did not introduce the observed skeweness in the real dataset.

Model 4

For the fourth model, we choose \(\alpha = 1\), a predictor (X) uniformly spaced between \([0.1,10]\) and \(\sigma^2 = 25\). Hence, the true value for \(\beta =\) 0.1414214.

Simulation parameters

We vary the amount of iterations B as well as the number of selected samples K in each iteration to find the optimal balance in bias and computation time. Both B and K range from 10 to 1000. We consider 10 levels between these points for each parameter. Hence we have \(10 \times 10 = 100\) scenarios.

    B  K  ...    B    K
1  10 10  ...  450 1000
2 120 10  ...  560 1000
3 230 10  ...  670 1000
4 340 10  ...  780 1000
5 450 10  ...  890 1000
6 560 10  ... 1000 1000

We simulate each combination of B and K for 1000 times.

Results

Model 1

Parameters:

u <- 1
alpha <- 5
sigma <- 1
trueBeta <- alpha/(sqrt(2) * sigma)
# Model (called SCEN in script)
SCEN <- 1

Load in data

Start with estimated \(\beta\):

# load in estimated beta values
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: estimated beta values averaged within the sampling algorithm
  ValSim <-  try(read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
               col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), 
               header = TRUE) %>% tbl_df() %>%
      group_by(K, nRSloops, TrueBeta) %>% summarise(avBeta = mean(beta)) %>% 
      mutate(sim = i), silent = TRUE)
  if(class(ValSim) == "try-error"){
    print(i);next
  }
  
  # Add to data frame with all simulations
  valuesAllSim <- bind_rows(valuesAllSim, ValSim)
}

Same for computational time:

# load in computational time
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: computational time 
  TimeSim <-  try(read.table(file = paste0(datLoc, 'uni_time_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
              col.names = c("minutes"), header = TRUE) %>% tbl_df() %>%
              bind_cols(., pairs) %>% 
              mutate(sim = i), silent = TRUE)
    if(class(TimeSim) == "try-error"){
    print(i);next
  }
  
  # Add to data frame with all simulations
  timeAllSim <- bind_rows(timeAllSim, TimeSim)
}

Bias vs computational time

Quick summary:

tbl_df(valuesAllSim)
# A tibble: 100,000 × 4
       K nRSloops   avBeta   sim
   <int>    <int>    <dbl> <int>
1     10       10 4.100895     1
2     10      120 4.162509     1
3     10      230 4.109357     1
4     10      340 4.143610     1
5     10      450 4.200134     1
6     10      560 4.203660     1
7     10      670 4.236758     1
8     10      780 4.451395     1
9     10      890 4.413864     1
10    10     1000 4.392884     1
# ... with 99,990 more rows
group_by(valuesAllSim, K, nRSloops) %>% summarise(avBeta = mean(avBeta)) %>% summary()
       K           nRSloops        avBeta     
 Min.   :  10   Min.   :  10   Min.   :3.540  
 1st Qu.: 230   1st Qu.: 230   1st Qu.:3.541  
 Median : 505   Median : 505   Median :3.545  
 Mean   : 505   Mean   : 505   Mean   :3.632  
 3rd Qu.: 780   3rd Qu.: 780   3rd Qu.:3.556  
 Max.   :1000   Max.   :1000   Max.   :4.388  

Plotting the estimated \(\beta\) parameters for each simulation through facets on the number of iterations (B).

MSEtable_tmp <- group_by(valuesAllSim, K, nRSloops) %>% mutate(SE = (avBeta - TrueBeta)^2) %>%
  summarise(MSE = mean(SE, na.rm = TRUE))
MSEtable <- matrix(MSEtable_tmp$MSE, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(MSEtable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(MSEtable) <- paste('K = ', K_vec, sep = '')
options( scipen = 6 )
knitr::kable(MSEtable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 1.2155397 0.7616045 0.7447556 0.7352143 0.7280799 0.7218844 0.7196143 0.7173493 0.7126988 0.7104787
K = 120 0.0150656 0.0029115 0.0023595 0.0021759 0.0020534 0.0019846 0.0019494 0.0019144 0.0018901 0.0018665
K = 230 0.0067774 0.0010084 0.0007558 0.0006425 0.0006106 0.0005776 0.0005625 0.0005529 0.0005440 0.0005369
K = 340 0.0044479 0.0006562 0.0004525 0.0003832 0.0003465 0.0003166 0.0003113 0.0003044 0.0002980 0.0002972
K = 450 0.0032917 0.0004495 0.0003090 0.0002659 0.0002391 0.0002240 0.0002182 0.0002125 0.0002078 0.0002052
K = 560 0.0027479 0.0003494 0.0002533 0.0002176 0.0001903 0.0001745 0.0001679 0.0001623 0.0001581 0.0001561
K = 670 0.0022307 0.0002729 0.0002076 0.0001715 0.0001559 0.0001479 0.0001418 0.0001379 0.0001353 0.0001318
K = 780 0.0019095 0.0002638 0.0001825 0.0001569 0.0001397 0.0001354 0.0001276 0.0001242 0.0001205 0.0001200
K = 890 0.0017163 0.0002233 0.0001612 0.0001383 0.0001234 0.0001150 0.0001103 0.0001081 0.0001063 0.0001025
K = 1000 0.0016339 0.0002003 0.0001443 0.0001178 0.0001109 0.0001047 0.0000996 0.0000965 0.0000938 0.0000919

Plotting the computational time.

Note: this is run on the HPC which will runs faster than regular PCs.

Combine bias and time

Can we combine time and bias in one graph? Maybe create a summary measure which involves a penalty term for the required time to estimate the parameters.

Distribution of estimator

The figures below show the QQ-plots of \(\widehat\beta\) associated with model 1 when having 500 samples in one iteration and applying 200 iterations.

for(j in 1:length(nRSloops_vec)){
  K_temp <- K_vec[j]
  B_temp <- nRSloops_vec[j]
  assign(paste0("Plot", j),
    valuesAllSim %>% filter(K == K_temp & nRSloops == B_temp) %>% 
    select(avBeta) %>% unlist(.) %>% 
    as.numeric() %>% gg_qq(x = ., title = paste0('K = ', K_temp, ' and B = ', B_temp))
  )
}

cowplot::plot_grid(Plot1, Plot2, Plot3, Plot4, Plot5, Plot6, labels = c("A", "B", "C", "D", "E", "F"), ncol = 3)
cowplot::plot_grid(Plot7, Plot8, Plot9, Plot10, labels = c("G", "H", "I", "J"), ncol = 3)

Emperical CI coverage

Within iteration sandwich estimator

In this section, we calculate for every iteration (B) the 95% confidence interval. Within each iteration, we can check its nominal coverage level and then average over all simulations. Note, that this is just equal to running multiple simulations in which you check the asymptotic properties of the sandwich estimator and not related to our research question!

# Empty data frame
valuesAvgCov <- data.frame() %>% tbl_df()

# load in estimated beta values and calculate coverage
for(i in 1:nsim){
  # Values in one simulation: estimated beta values averaged within the sampling algorithm
  ValSim <-  try(read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
                 col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
                   tbl_df(), silent = TRUE)
  if(class(ValSim)[1] == "try-error"){
    print(i);next
  }
  # Calculate coverage per simulation: estimate +/- (1.96 * sqrt(sandwichVariance)) for 95% CI
  WithCoverage <- mutate(ValSim, CIlow = beta - (qnorm(p = ((1 - level)/2), lower.tail = FALSE) * sqrt(sVariance))) %>%
              mutate(CIup = beta + (qnorm(p = ((1 - level)/2), lower.tail = FALSE) * sqrt(sVariance))) %>% rowwise() %>%
              mutate(coverage = ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0))
      
  # Average coverage per simulation
  AvgCov <- WithCoverage %>% ungroup() %>% group_by(K, nRSloops) %>%
      summarise(AvgCov = mean(coverage, na.rm = TRUE)) %>% 
      mutate(sim = i)
  
  # Add to data frame with all simulations
  valuesAvgCov <- bind_rows(valuesAvgCov, AvgCov)
}

Now we proceed to plot the average CI level per B and K.

For more detail, we refer to the table below with the average emperical coverage per B (columns) and K (rows).

B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.811 0.778 0.760 0.760 0.767 0.768 0.769 0.761 0.763 0.763
K = 120 0.940 0.938 0.936 0.936 0.936 0.936 0.935 0.935 0.934 0.934
K = 230 0.960 0.942 0.942 0.940 0.941 0.941 0.941 0.941 0.941 0.940
K = 340 0.936 0.945 0.941 0.944 0.944 0.944 0.943 0.944 0.944 0.943
K = 450 0.940 0.948 0.948 0.949 0.945 0.945 0.944 0.945 0.945 0.946
K = 560 0.952 0.949 0.944 0.946 0.947 0.948 0.948 0.947 0.946 0.946
K = 670 0.940 0.940 0.946 0.946 0.946 0.946 0.946 0.946 0.946 0.947
K = 780 0.944 0.947 0.948 0.950 0.949 0.950 0.950 0.949 0.949 0.949
K = 890 0.960 0.947 0.950 0.948 0.949 0.949 0.949 0.948 0.950 0.949
K = 1000 0.956 0.953 0.952 0.953 0.951 0.950 0.951 0.951 0.951 0.950
Bias-corrected bootstrap based CI

Now we turn to the more interesting question. How can we create confidence intervals around the average \(\beta\) PIM estimate obtained in one iteration in the non optimal resampling procedure? We suggest to use a bootstrap procedure. Let us try with the bias-corrected and accelerated bootstrap (BCa) which adjusts both for bias and skeweness. Note, this might introduce extra computation time. To get the results here, we run all simulations in parallel. Note that we defined the function to calculate the coverages (BootParCI) earlier.

# Detect and start the workers
P <- detectCores(logical = FALSE) # physical cores
cl <- makeCluster(P)

# Initialize them with the libraries
clusterEvalQ(cl, library(tidyverse))
clusterEvalQ(cl, library(boot))

# Run in parallel and append results
boot_par_results <- clusterApply(cl, 1:nsim, fun = BootParCI, datLoc = datLoc, SCEN = SCEN)
boot_CI_coverage <- do.call(rbind, boot_par_results)

# Stop the workers
stopCluster(cl)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.639 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.877 0.724 0.559 0.434 0.339 0.261 0.194 0.146 0.125 0.088
K = 230 0.896 0.844 0.746 0.675 0.583 0.512 0.458 0.395 0.361 0.322
K = 340 0.897 0.835 0.783 0.714 0.667 0.616 0.556 0.524 0.475 0.424
K = 450 0.908 0.871 0.818 0.744 0.692 0.658 0.613 0.578 0.553 0.518
K = 560 0.886 0.875 0.814 0.754 0.718 0.674 0.659 0.627 0.597 0.572
K = 670 0.895 0.890 0.802 0.766 0.721 0.671 0.643 0.619 0.595 0.564
K = 780 0.897 0.863 0.812 0.761 0.722 0.662 0.648 0.609 0.586 0.567
K = 890 0.886 0.861 0.808 0.746 0.709 0.663 0.640 0.611 0.584 0.568
K = 1000 0.869 0.864 0.814 0.763 0.715 0.676 0.651 0.622 0.588 0.558

Quite bad coverages!

Student - t confidence intervals

What happens if we consider the PIM \(\beta\) estimates as K i.i.d. datapoints and construct a \(t\)-distributed CI interval? Using the standard deviation of the estimated \(\beta\) parameters, we get:

\[ [\bar{\beta} \pm t_{\alpha/2, K-1} sd(\beta)/\sqrt{(K - 1)}] \] Note, this is not really correct. But we just want to see what effect it has on the emperical coverage.

norm_tAvg <- c()
# Read in data (again, sigh)
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
        tbl_df()
  # CI using SD of data and normal quantiles and using t-quantiles
    # Formula = sd/sqrt(n) for some strange reason (I expected sd of beta being equal to se)
  norm_tmp <- ValSim %>% group_by(K, nRSloops) %>% 
          mutate(
          CIlow = beta - (qnorm(0.025, lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          CIup = beta + (qnorm(0.025, lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          type = 'norm') %>% group_by(K, nRSloops, TrueBeta, type) %>%
    summarise(AvgBeta = mean(beta, na.rm = TRUE),
          AvgLow = mean(CIlow, na.rm = TRUE), 
          AvgUp = mean(CIup, na.rm = TRUE)) %>% ungroup() %>% rowwise() %>% 
    mutate(coverage_ind = ifelse(TrueBeta >= AvgLow && TrueBeta <= AvgUp, 1, 0),
          sim = i)
  t_tmp <- ValSim %>% group_by(K, nRSloops) %>% 
          mutate(
          CIlow = beta - (qt(0.025, df = (K - 1), lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          CIup = beta + (qt(0.025, df = (K - 1), lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          type = 't') %>% group_by(K, nRSloops, TrueBeta, type) %>%
    summarise(AvgBeta = mean(beta, na.rm = TRUE),
          AvgLow = mean(CIlow, na.rm = TRUE), 
          AvgUp = mean(CIup, na.rm = TRUE)) %>% ungroup() %>% rowwise() %>% 
    mutate(coverage_ind = ifelse(TrueBeta >= AvgLow && TrueBeta <= AvgUp, 1, 0),
          sim = i)
  
  # Construct general CI by averaging endpoints
  norm_tAvg <- bind_rows(norm_tAvg,norm_tmp, t_tmp)
}
# Plot
norm_tAvg %>% filter(type == 't' & K == 1000 & nRSloops == 1000) %>% 
  slice(round(seq(1,nsim,length.out = 50),0)) %>%
   ggplot(., aes(x = TrueBeta)) + geom_vline(aes(xintercept = TrueBeta)) +
   geom_point(aes(x = AvgBeta, y = sim)) + 
   geom_segment(aes(x = AvgLow, xend = AvgUp, y = sim, yend = sim)) + 
   scale_x_continuous("beta") + scale_y_continuous("simulation") +
   ggtitle("95% CI around beta PIM estimate")

resTable_tmp <- norm_tAvg %>% filter(type == 't') %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
resTable <- matrix(resTable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(resTable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(resTable) <- paste('K = ', K_vec, sep = '')
knitr::kable(resTable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.908 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
K = 120 0.381 0.742 0.829 0.863 0.897 0.915 0.934 0.937 0.946 0.956
K = 230 0.309 0.686 0.748 0.804 0.828 0.850 0.869 0.870 0.872 0.873
K = 340 0.269 0.603 0.681 0.720 0.750 0.781 0.788 0.790 0.792 0.809
K = 450 0.240 0.567 0.651 0.674 0.707 0.729 0.735 0.738 0.731 0.739
K = 560 0.196 0.509 0.591 0.618 0.674 0.674 0.708 0.714 0.709 0.724
K = 670 0.190 0.489 0.562 0.603 0.625 0.629 0.648 0.659 0.658 0.676
K = 780 0.159 0.452 0.532 0.576 0.582 0.596 0.606 0.612 0.611 0.621
K = 890 0.158 0.422 0.485 0.507 0.543 0.548 0.566 0.583 0.579 0.593
K = 1000 0.138 0.395 0.467 0.530 0.527 0.534 0.548 0.562 0.569 0.564
Bootstrap m-out-of n

Essentially, the non-optimal subsampling algorithm can also be consider as bootstrapping m-out-of n, with m \(<<\) n. In our case, m corresponds to K the subsample size.

However, in the bootstrap algorithm, it is essentially to scale the CI in order to match the variability of the full dataset. This scaling is done by multiplying the standard error of the estimator (\(se(\hat\theta_n)\)) through:

\[ se(\hat\theta_n) \approx se(\theta^*_k) \times \sqrt{\frac{K}{n}} \] In which \(se(\theta^*_k)\) is equal to the standard deviation of the sampling distribution for each iteration B.

We shall try this here.

scaledSD <- data.frame() %>% tbl_df()
# Read in data
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
        tbl_df()
  # CI using SD of beta with scaling
  sdScale_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    sdBeta = sd(beta, na.rm = TRUE)) %>%
          mutate(CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              type = 'scaledSD') %>% 
          ungroup() %>% rowwise() %>% 
          mutate(coverage_ind = ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0),
              sim = i)
  
  # Collect the values over all simulations
  scaledSD <- bind_rows(scaledSD,sdScale_tmp)
}

scaledSDTable_tmp <- scaledSD %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
scaledSDTable <- matrix(scaledSDTable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(scaledSDTable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(scaledSDTable) <- paste('K = ', K_vec, sep = '')
knitr::kable(scaledSDTable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.012 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.092 0.172 0.145 0.113 0.091 0.065 0.054 0.041 0.035 0.029
K = 230 0.128 0.334 0.363 0.353 0.348 0.340 0.340 0.334 0.313 0.307
K = 340 0.185 0.410 0.474 0.511 0.527 0.544 0.537 0.544 0.538 0.532
K = 450 0.220 0.512 0.588 0.624 0.640 0.660 0.676 0.676 0.675 0.676
K = 560 0.228 0.551 0.634 0.686 0.725 0.731 0.750 0.757 0.765 0.774
K = 670 0.237 0.624 0.690 0.738 0.764 0.777 0.786 0.797 0.809 0.812
K = 780 0.250 0.644 0.738 0.769 0.786 0.797 0.810 0.815 0.824 0.826
K = 890 0.269 0.660 0.778 0.796 0.816 0.832 0.847 0.852 0.862 0.867
K = 1000 0.268 0.694 0.796 0.835 0.859 0.865 0.874 0.878 0.886 0.893
Scaled sandwich variance
scaledSVAR <- data.frame() %>% tbl_df()
# Read in data
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
        tbl_df()
  # CI using mean sandwich variance of beta with scaling
  svarScale_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    AvgSvarBeta = mean(sVariance, na.rm = TRUE)) %>%
          mutate(CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) *
                                       sqrt(AvgSvarBeta) * sqrt(K/n)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * 
                                  sqrt(AvgSvarBeta) * sqrt(K/n)),
              type = 'scaledSVAR') %>% 
          ungroup() %>% rowwise() %>% 
          mutate(coverage_ind = ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0),
              sim = i)
  
  # Collect the values over all simulations
  scaledSVAR <- bind_rows(scaledSVAR,svarScale_tmp)
}

scaledSVARtable_tmp <- scaledSVAR %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
scaledSVARtable <- matrix(scaledSVARtable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(scaledSVARtable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(scaledSVARtable) <- paste('K = ', K_vec, sep = '')
knitr::kable(scaledSVARtable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.006 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.092 0.164 0.142 0.106 0.088 0.061 0.047 0.040 0.034 0.026
K = 230 0.131 0.334 0.352 0.345 0.338 0.332 0.326 0.324 0.304 0.303
K = 340 0.191 0.407 0.465 0.502 0.516 0.533 0.526 0.532 0.526 0.523
K = 450 0.218 0.519 0.584 0.623 0.628 0.656 0.668 0.674 0.668 0.670
K = 560 0.230 0.550 0.630 0.679 0.727 0.721 0.753 0.750 0.761 0.773
K = 670 0.254 0.617 0.689 0.734 0.760 0.771 0.782 0.792 0.809 0.813
K = 780 0.270 0.639 0.738 0.766 0.784 0.801 0.810 0.809 0.824 0.828
K = 890 0.281 0.666 0.773 0.801 0.820 0.832 0.844 0.855 0.860 0.866
K = 1000 0.273 0.695 0.800 0.826 0.858 0.867 0.871 0.873 0.886 0.887
Sandwich variance estimator CI

After talking with Jan, it should really be:

\[ \left[ \widehat{\beta_{final}} \pm z_{1-\alpha/2} \times \sqrt{\frac{1}{B^2} \sum_{i=1}^B \widehat{\text{Var}(\beta_i)}} \right] \] As you have B parts of K datapoints. To get \(\widehat{\beta}_{final}\), which is the end estimate, you take the mean of the B estimates. Hence you sum and divide by \(B\). The variance of a constant times the sum equals the squared constant times the sum of the variance. If, the B parts are independent!
However, in our subsampling scheme, this is not the case! So I expect some worse coverages here, but a better coverage in the bag of m-ou-of-n bootstrap procedure.

SummedSvar <- data.frame() %>% tbl_df()
# Read in data
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
        tbl_df()
  # CI through now summing sandwich variance of beta
  SummedSvar_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    SummedSvar = sum(sVariance, na.rm = TRUE)) %>%
          mutate(bVar = SummedSvar * (1/nRSloops^2),
              CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) *
                                       sqrt(bVar)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * 
                                  sqrt(bVar)),
              type = 'avSVar',  sim = i,
              coverage_ind = 
                ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0))
  
  # Collect the values over all simulations
  SummedSvar <- bind_rows(SummedSvar,SummedSvar_tmp)
}

SummedSvartable_tmp <- SummedSvar %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
SummedSvartable <- matrix(SummedSvartable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(SummedSvartable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(SummedSvartable) <- paste('K = ', K_vec, sep = '')
knitr::kable(SummedSvartable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.458 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.923 0.692 0.531 0.403 0.304 0.233 0.178 0.137 0.107 0.080
K = 230 0.934 0.833 0.731 0.661 0.568 0.493 0.437 0.389 0.344 0.315
K = 340 0.945 0.852 0.774 0.703 0.653 0.615 0.546 0.513 0.470 0.422
K = 450 0.951 0.876 0.807 0.745 0.695 0.654 0.621 0.567 0.539 0.513
K = 560 0.940 0.882 0.815 0.751 0.724 0.671 0.662 0.614 0.590 0.564
K = 670 0.942 0.888 0.810 0.761 0.716 0.673 0.637 0.621 0.591 0.560
K = 780 0.939 0.865 0.804 0.752 0.720 0.668 0.638 0.610 0.584 0.575
K = 890 0.950 0.864 0.809 0.743 0.702 0.659 0.648 0.617 0.580 0.568
K = 1000 0.935 0.880 0.815 0.764 0.714 0.678 0.642 0.614 0.581 0.558

Model 2

Parameters:

u <- 10
alpha <- 10
sigma <- 5
trueBeta <- alpha/(sqrt(2) * sigma)
# Model (called SCEN in script)
SCEN <- 2

Load in data

Start with estimated \(\beta\):

# load in estimated beta values
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: estimated beta values averaged within the sampling algorithm
  ValSim <-  try(read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
               col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), 
               header = TRUE) %>% tbl_df() %>%
      group_by(K, nRSloops) %>% summarise(avBeta = mean(beta)) %>% 
      mutate(sim = i), silent = TRUE)
  if(class(ValSim) == "try-error"){
    print(i);next
  }
  
  # Add to data frame with all simulations
  valuesAllSim <- bind_rows(valuesAllSim, ValSim)
}

Same for computational time:

# load in computational time
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: computational time 
  TimeSim <-  try(read.table(file = paste0(datLoc, 'uni_time_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
              col.names = c("minutes"), header = TRUE) %>% tbl_df() %>%
              bind_cols(., pairs) %>% 
              mutate(sim = i), silent = TRUE)
    if(class(TimeSim) == "try-error"){
    print(i);next
  }
  
  # Add to data frame with all simulations
  timeAllSim <- bind_rows(timeAllSim, TimeSim)
}

Bias vs computational time

Quick summary:

tbl_df(valuesAllSim)
# A tibble: 49,900 × 4
       K nRSloops   avBeta   sim
   <int>    <int>    <dbl> <int>
1     10       10 1.037945     1
2     10      120 1.173419     1
3     10      230 1.113818     1
4     10      340 1.119661     1
5     10      450 1.109486     1
6     10      560 1.109550     1
7     10      670 1.107620     1
8     10      780 1.086124     1
9     10      890 1.089851     1
10    10     1000 1.090058     1
# ... with 49,890 more rows
group_by(valuesAllSim, K, nRSloops) %>% summarise(avBeta = mean(avBeta)) %>% summary()
       K           nRSloops        avBeta      
 Min.   :  10   Min.   :  10   Min.   :0.8400  
 1st Qu.: 230   1st Qu.: 230   1st Qu.:0.8407  
 Median : 505   Median : 505   Median :0.8424  
 Mean   : 505   Mean   : 505   Mean   :0.8693  
 3rd Qu.: 780   3rd Qu.: 780   3rd Qu.:0.8478  
 Max.   :1000   Max.   :1000   Max.   :1.1006  

Plotting the estimated \(\beta\) parameters for each simulation through facets on the number of iterations (B).

Plotting the computational time.

These data generating parameters do not work!

Model 3

Parameters:

alpha_1 <- -0.43
# Sigma based on dataset
sigma <- 9.51
trueBeta <- alpha_1/(sqrt(2) * sigma)
# Model (called SCEN in script)
SCEN <- 3

Load in data

Start with estimated \(\beta\):

S3_colnames <- c("beta.X_smartph_hrs", "beta.X_sex", 
                 "beta.X_econArea",  "beta.X_ethnic", 
                 "sVariance.X_smartph_hrs", "sVariance.X_sex",
                 "sVariance.X_econArea", 
                 "sVariance.X_ethnic",  "K", "nRSloops", "TrueBeta")

# load in estimated beta values
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: estimated beta values averaged within the sampling algorithm
  ValSim <-  read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_3.txt'),
                        col.names = S3_colnames, header = TRUE) %>% tbl_df()  %>%
        group_by(K, nRSloops) %>% summarise_all(mean) %>%
        mutate(sim = i)
  
   if(class(ValSim) == "try-error"){
    print(i);next
  }
  
  # Add to data frame with all simulations
  valuesAllSim <- bind_rows(valuesAllSim, ValSim)
}

Same for computational time:

# load in computational time
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: computational time 
  TimeSim <-  read.table(file = paste0(datLoc, 'uni_time_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
              col.names = c("minutes"), header = TRUE) %>% tbl_df() %>%
              bind_cols(., pairs) %>% 
              mutate(sim = i)
  
  # Add to data frame with all simulations
  timeAllSim <- bind_rows(timeAllSim, TimeSim)
}

Bias vs computational time

Quick summary:

group_by(valuesAllSim, K, nRSloops) %>% summarise(avBeta = mean(beta.X_smartph_hrs)) %>% summary()
       K           nRSloops        avBeta        
 Min.   :  10   Min.   :  10   Min.   :-0.05688  
 1st Qu.: 230   1st Qu.: 230   1st Qu.:-0.03234  
 Median : 505   Median : 505   Median :-0.03220  
 Mean   : 505   Mean   : 505   Mean   :-0.03465  
 3rd Qu.: 780   3rd Qu.: 780   3rd Qu.:-0.03212  
 Max.   :1000   Max.   :1000   Max.   :-0.03193  

Plotting the estimated \(\beta\) parameters for each simulation through facetting on the number of iterations (B).

MSEtable_tmp <- group_by(valuesAllSim, K, nRSloops) %>% mutate(SE = (beta.X_smartph_hrs - TrueBeta)^2) %>%
  summarise(MSE = mean(SE, na.rm = TRUE))
MSEtable <- matrix(MSEtable_tmp$MSE, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(MSEtable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(MSEtable) <- paste('K = ', K_vec, sep = '')
options( scipen = 6 )
knitr::kable(MSEtable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.0093666 0.0013358 0.0010234 0.0008958 0.0008330 0.0007854 0.0007513 0.0007282 0.0007313 0.0007220
K = 120 0.0000887 0.0000088 0.0000051 0.0000037 0.0000030 0.0000027 0.0000024 0.0000022 0.0000021 0.0000020
K = 230 0.0000474 0.0000045 0.0000025 0.0000019 0.0000016 0.0000013 0.0000012 0.0000011 0.0000010 0.0000010
K = 340 0.0000312 0.0000027 0.0000018 0.0000014 0.0000012 0.0000010 0.0000010 0.0000009 0.0000009 0.0000009
K = 450 0.0000243 0.0000024 0.0000015 0.0000012 0.0000010 0.0000009 0.0000009 0.0000008 0.0000008 0.0000007
K = 560 0.0000182 0.0000020 0.0000012 0.0000009 0.0000009 0.0000008 0.0000008 0.0000007 0.0000007 0.0000006
K = 670 0.0000154 0.0000018 0.0000012 0.0000010 0.0000009 0.0000008 0.0000007 0.0000007 0.0000006 0.0000006
K = 780 0.0000136 0.0000015 0.0000010 0.0000008 0.0000007 0.0000006 0.0000006 0.0000006 0.0000006 0.0000006
K = 890 0.0000123 0.0000012 0.0000009 0.0000007 0.0000007 0.0000006 0.0000006 0.0000006 0.0000006 0.0000006
K = 1000 0.0000105 0.0000012 0.0000008 0.0000007 0.0000006 0.0000006 0.0000006 0.0000005 0.0000005 0.0000005

Plotting the computational time.

Distribution of estimator

for(j in 1:length(nRSloops_vec)){
  K_temp <- K_vec[j]
  B_temp <- nRSloops_vec[j]
  assign(paste0("Plot", j),
    valuesAllSim %>% filter(K == K_temp & nRSloops == B_temp) %>% 
    select(beta.X_smartph_hrs) %>% unlist(.) %>% 
    as.numeric() %>% gg_qq(x = ., title = paste0('K = ', K_temp, ' and B = ', B_temp))
  )
}

cowplot::plot_grid(Plot1, Plot2, Plot3, Plot4, Plot5, Plot6, labels = c("A", "B", "C", "D", "E", "F"), ncol = 3)
cowplot::plot_grid(Plot7, Plot8, Plot9, Plot10, labels = c("G", "H", "I", "J"), ncol = 3)

Emperical CI coverage

Bias-corrected bootstrap based CI

Same procedure as model 1:

# Detect and start the workers
P <- detectCores(logical = FALSE) # physical cores
cl <- makeCluster(P)

# Initialize them with the libraries
clusterEvalQ(cl, library(tidyverse))
clusterEvalQ(cl, library(boot))

# Run in parallel and append results
boot_par_results <- clusterApply(cl, 1:nsim, fun = BootParCI, datLoc = datLoc, SCEN = SCEN)
boot_CI_coverage <- do.call(rbind, boot_par_results)

# Stop the workers
stopCluster(cl)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.856 0.841 0.737 0.647 0.569 0.509 0.451 0.393 0.320 0.285
K = 120 0.905 0.928 0.916 0.898 0.886 0.872 0.850 0.834 0.822 0.806
K = 230 0.899 0.936 0.921 0.900 0.897 0.874 0.881 0.855 0.835 0.819
K = 340 0.905 0.934 0.906 0.875 0.854 0.855 0.817 0.785 0.772 0.762
K = 450 0.894 0.922 0.879 0.860 0.824 0.804 0.780 0.766 0.745 0.727
K = 560 0.901 0.913 0.893 0.850 0.815 0.803 0.775 0.750 0.722 0.723
K = 670 0.901 0.912 0.870 0.802 0.779 0.751 0.729 0.707 0.689 0.673
K = 780 0.891 0.910 0.855 0.827 0.788 0.766 0.745 0.705 0.667 0.636
K = 890 0.878 0.918 0.863 0.808 0.764 0.734 0.686 0.671 0.639 0.616
K = 1000 0.886 0.876 0.851 0.776 0.733 0.704 0.693 0.666 0.633 0.614

Somewhat better.

Student - t confidence intervals
norm_tAvg <- c()
# Read in data (again, sigh)
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = S3_colnames, header = TRUE) %>% 
        tbl_df()
  ValSim <- rename(ValSim, beta = beta.X_smartph_hrs)
  # CI using SD of data and normal quantiles and using t-quantiles
    # Formula = sd/sqrt(n) for some strange reason (I expected sd of beta being equal to se)
  norm_tmp <- ValSim %>% group_by(K, nRSloops) %>% 
          mutate(
          CIlow = beta - (qnorm(0.025, lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          CIup = beta + (qnorm(0.025, lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          type = 'norm') %>% group_by(K, nRSloops, TrueBeta, type) %>%
    summarise(AvgBeta = mean(beta, na.rm = TRUE),
          AvgLow = mean(CIlow, na.rm = TRUE), 
          AvgUp = mean(CIup, na.rm = TRUE)) %>% ungroup() %>% rowwise() %>% 
    mutate(coverage_ind = ifelse(TrueBeta >= AvgLow && TrueBeta <= AvgUp, 1, 0),
          sim = i)
  t_tmp <- ValSim %>% group_by(K, nRSloops) %>% 
          mutate(
          CIlow = beta - (qt(0.025, df = (K - 1), lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          CIup = beta + (qt(0.025, df = (K - 1), lower.tail = FALSE) * (sd(beta)/(sqrt(K - 1)))),
          type = 't') %>% group_by(K, nRSloops, TrueBeta, type) %>%
    summarise(AvgBeta = mean(beta, na.rm = TRUE),
          AvgLow = mean(CIlow, na.rm = TRUE), 
          AvgUp = mean(CIup, na.rm = TRUE)) %>% ungroup() %>% rowwise() %>% 
    mutate(coverage_ind = ifelse(TrueBeta >= AvgLow && TrueBeta <= AvgUp, 1, 0),
          sim = i)
  
  # Construct general CI by averaging endpoints
  norm_tAvg <- bind_rows(norm_tAvg,norm_tmp, t_tmp)
}
# Plot
norm_tAvg %>% filter(type == 't' & K == 120 & nRSloops == 120) %>% 
  slice(round(seq(1,nsim,length.out = 50),0)) %>%
   ggplot(., aes(x = TrueBeta)) + geom_vline(aes(xintercept = TrueBeta)) +
   geom_point(aes(x = AvgBeta, y = sim)) + 
   geom_segment(aes(x = AvgLow, xend = AvgUp, y = sim, yend = sim)) + 
   scale_x_continuous("beta") + scale_y_continuous("simulation") +
   ggtitle("95% CI around beta PIM estimate")

resTable_tmp <- norm_tAvg %>% filter(type == 't') %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
resTable <- matrix(resTable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(resTable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(resTable) <- paste('K = ', K_vec, sep = '')
knitr::kable(resTable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.959 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
K = 120 0.439 0.934 0.985 0.996 1.000 0.999 1.000 1.000 1.000 1.000
K = 230 0.312 0.804 0.920 0.959 0.979 0.985 0.991 0.995 0.992 0.995
K = 340 0.244 0.738 0.839 0.881 0.919 0.936 0.946 0.955 0.960 0.953
K = 450 0.214 0.629 0.759 0.806 0.822 0.855 0.876 0.879 0.890 0.900
K = 560 0.211 0.571 0.693 0.756 0.774 0.802 0.807 0.821 0.832 0.840
K = 670 0.178 0.503 0.627 0.665 0.683 0.721 0.732 0.753 0.764 0.766
K = 780 0.174 0.493 0.600 0.634 0.660 0.678 0.703 0.706 0.702 0.713
K = 890 0.137 0.480 0.544 0.582 0.618 0.628 0.627 0.643 0.638 0.636
K = 1000 0.170 0.427 0.507 0.522 0.542 0.575 0.597 0.601 0.600 0.604
Bootstrap m-out-of n

Again, scale the CI according to bootstrap m-out-of n theory.

scaledSD <- data.frame() %>% tbl_df()
# Read in data (again, sigh)
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = S3_colnames, header = TRUE) %>% 
        tbl_df()
  ValSim <- rename(ValSim, beta = beta.X_smartph_hrs)
  # CI using SD of beta with scaling
  sdScale_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    sdBeta = sd(beta, na.rm = TRUE)) %>%
          mutate(CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              type = 'scaledSD') %>% 
          ungroup() %>% rowwise() %>% 
          mutate(coverage_ind = ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0),
              sim = i)
  
  # Collect the values over all simulations
  scaledSD <- bind_rows(scaledSD,sdScale_tmp)
}

scaledSDTable_tmp <- scaledSD %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
scaledSDTable <- matrix(scaledSDTable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(scaledSDTable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(scaledSDTable) <- paste('K = ', K_vec, sep = '')
knitr::kable(scaledSDTable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.021 0.079 0.059 0.057 0.048 0.044 0.028 0.033 0.025 0.021
K = 120 0.124 0.320 0.440 0.505 0.561 0.584 0.595 0.613 0.616 0.622
K = 230 0.152 0.447 0.565 0.647 0.676 0.737 0.744 0.776 0.785 0.803
K = 340 0.164 0.552 0.652 0.712 0.753 0.785 0.789 0.796 0.817 0.825
K = 450 0.189 0.567 0.703 0.755 0.788 0.808 0.829 0.829 0.846 0.854
K = 560 0.241 0.621 0.748 0.809 0.817 0.849 0.857 0.865 0.871 0.879
K = 670 0.235 0.652 0.762 0.794 0.823 0.842 0.860 0.864 0.882 0.873
K = 780 0.257 0.706 0.790 0.833 0.871 0.876 0.885 0.888 0.900 0.903
K = 890 0.263 0.716 0.826 0.845 0.876 0.889 0.898 0.907 0.916 0.920
K = 1000 0.313 0.737 0.830 0.851 0.875 0.891 0.901 0.898 0.908 0.917
Sandwich variance estimator CI
S3_colnames <- c("beta.X_smartph_hrs", "beta.X_sex", 
                 "beta.X_econArea",  "beta.X_ethnic", 
                 "sVariance.X_smartph_hrs", "sVariance.X_sex",
                 "sVariance.X_econArea", 
                 "sVariance.X_ethnic",  "K", "nRSloops", "TrueBeta")

SummedSvar <- data.frame() %>% tbl_df()
# Read in data
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = S3_colnames, header = TRUE) %>% 
        tbl_df()
  ValSim <- rename(ValSim, beta = beta.X_smartph_hrs, sVariance = sVariance.X_smartph_hrs)
  # CI through now summing sandwich variance of beta
  SummedSvar_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    SummedSvar = sum(sVariance, na.rm = TRUE)) %>%
          mutate(bVar = SummedSvar * (1/nRSloops^2),
              CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) *
                                       sqrt(bVar)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * 
                                  sqrt(bVar)),
              type = 'avSVar',  sim = i,
              coverage_ind = 
                ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0))
  
  # Collect the values over all simulations
  SummedSvar <- bind_rows(SummedSvar,SummedSvar_tmp)
}

SummedSvartable_tmp <- SummedSvar %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
SummedSvartable <- matrix(SummedSvartable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(SummedSvartable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(SummedSvartable) <- paste('K = ', K_vec, sep = '')
knitr::kable(SummedSvartable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.638 NA NA NA NA NA NA NA NA NA
K = 120 0.945 0.930 0.908 0.896 0.873 0.853 0.836 0.824 0.802 0.789
K = 230 0.945 0.932 0.915 0.896 0.888 0.873 0.871 0.845 0.824 0.811
K = 340 0.948 0.939 0.911 0.877 0.850 0.846 0.815 0.781 0.773 0.763
K = 450 0.942 0.917 0.876 0.856 0.821 0.804 0.781 0.754 0.738 0.718
K = 560 0.952 0.917 0.893 0.852 0.815 0.799 0.772 0.740 0.719 0.714
K = 670 0.951 0.907 0.864 0.815 0.779 0.754 0.726 0.706 0.682 0.667
K = 780 0.946 0.911 0.854 0.823 0.781 0.770 0.735 0.700 0.673 0.644
K = 890 0.937 0.919 0.871 0.808 0.768 0.725 0.686 0.668 0.635 0.614
K = 1000 0.942 0.889 0.851 0.783 0.737 0.709 0.689 0.660 0.630 0.607

Model 4

Parameters:

u <- 10
alpha <- 1
sigma <- 5
trueBeta <- alpha/(sqrt(2) * sigma)
# Model (called SCEN in script)
SCEN <- 4

Load in data

Start with estimated \(\beta\):

# load in estimated beta values
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: estimated beta values averaged within the sampling algorithm
  ValSim <-  try(read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
               col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), 
               header = TRUE) %>% tbl_df() %>%
      group_by(K, nRSloops, TrueBeta) %>% summarise(avBeta = mean(beta)) %>% 
      mutate(sim = i), silent = TRUE)
  if(class(ValSim)[1] == "try-error"){
    print(i);next
  }
  # Add to data frame with all simulations
  valuesAllSim <- bind_rows(valuesAllSim, ValSim)
}

Same for computational time:

# load in computational time
for(i in 1:nsim){
  if(i %in% progress) print(paste0("At ", i/nsim*100, "%"))
  # Values in one simulation: computational time 
  TimeSim <-  try(read.table(file = paste0(datLoc, 'uni_time_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
              col.names = c("minutes"), header = TRUE) %>% tbl_df() %>%
              bind_cols(., pairs) %>% 
              mutate(sim = i), silent = TRUE)
    if(class(TimeSim)[1] == "try-error"){
    print(i);next
  }
  # Add to data frame with all simulations
  timeAllSim <- bind_rows(timeAllSim, TimeSim)
}

Bias vs computational time

Quick summary:

tbl_df(valuesAllSim)
# A tibble: 100,000 × 5
       K nRSloops  TrueBeta    avBeta   sim
   <int>    <int>     <dbl>     <dbl> <int>
1     10       10 0.1414214 0.1332534     1
2     10      120 0.1414214 0.1598918     1
3     10      230 0.1414214 0.1691766     1
4     10      340 0.1414214 0.1729972     1
5     10      450 0.1414214 0.1733743     1
6     10      560 0.1414214 0.1748912     1
7     10      670 0.1414214 0.1727332     1
8     10      780 0.1414214 0.1755666     1
9     10      890 0.1414214 0.1776100     1
10    10     1000 0.1414214 0.1783243     1
# ... with 99,990 more rows
group_by(valuesAllSim, K, nRSloops) %>% summarise(avBeta = mean(avBeta)) %>% summary()
       K           nRSloops        avBeta      
 Min.   :  10   Min.   :  10   Min.   :0.1415  
 1st Qu.: 230   1st Qu.: 230   1st Qu.:0.1417  
 Median : 505   Median : 505   Median :0.1417  
 Mean   : 505   Mean   : 505   Mean   :0.1450  
 3rd Qu.: 780   3rd Qu.: 780   3rd Qu.:0.1422  
 Max.   :1000   Max.   :1000   Max.   :0.1742  

Plotting the estimated \(\beta\) parameters for each simulation through facets on the number of iterations (B).

MSEtable_tmp <- group_by(valuesAllSim, K, nRSloops) %>% mutate(SE = (avBeta - TrueBeta)^2) %>%
  summarise(MSE = mean(SE, na.rm = TRUE))
MSEtable <- matrix(MSEtable_tmp$MSE, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(MSEtable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(MSEtable) <- paste('K = ', K_vec, sep = '')
options( scipen = 6 )
knitr::kable(MSEtable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.0028661 0.0011504 0.0010975 0.0010447 0.0010452 0.0010334 0.0010264 0.0010227 0.0010144 0.0010101
K = 120 0.0000726 0.0000078 0.0000057 0.0000046 0.0000044 0.0000040 0.0000038 0.0000037 0.0000035 0.0000034
K = 230 0.0000344 0.0000038 0.0000025 0.0000020 0.0000018 0.0000016 0.0000015 0.0000014 0.0000014 0.0000013
K = 340 0.0000230 0.0000025 0.0000016 0.0000013 0.0000011 0.0000010 0.0000010 0.0000009 0.0000009 0.0000008
K = 450 0.0000165 0.0000017 0.0000011 0.0000009 0.0000008 0.0000008 0.0000007 0.0000007 0.0000006 0.0000006
K = 560 0.0000137 0.0000017 0.0000010 0.0000009 0.0000007 0.0000007 0.0000006 0.0000006 0.0000006 0.0000005
K = 670 0.0000115 0.0000013 0.0000008 0.0000007 0.0000006 0.0000006 0.0000006 0.0000005 0.0000005 0.0000005
K = 780 0.0000097 0.0000011 0.0000008 0.0000007 0.0000006 0.0000006 0.0000005 0.0000005 0.0000005 0.0000005
K = 890 0.0000092 0.0000010 0.0000007 0.0000006 0.0000006 0.0000005 0.0000005 0.0000005 0.0000005 0.0000004
K = 1000 0.0000080 0.0000010 0.0000007 0.0000006 0.0000005 0.0000005 0.0000005 0.0000004 0.0000004 0.0000004

Plotting the computational time.

Distribution of estimator

for(j in 1:length(nRSloops_vec)){
  K_temp <- K_vec[j]
  B_temp <- nRSloops_vec[j]
  assign(paste0("Plot", j),
    valuesAllSim %>% filter(K == K_temp & nRSloops == B_temp) %>% 
    select(avBeta) %>% unlist(.) %>% 
    as.numeric() %>% gg_qq(x = ., title = paste0('K = ', K_temp, ' and B = ', B_temp))
  )
}

cowplot::plot_grid(Plot1, Plot2, Plot3, Plot4, Plot5, Plot6, labels = c("A", "B", "C", "D", "E", "F"), ncol = 3)
cowplot::plot_grid(Plot7, Plot8, Plot9, Plot10, labels = c("G", "H", "I", "J"), ncol = 3)

Emperical CI coverage

Bootstrap m-out-of n

Again, scale the CI according to bootstrap m-out-of n theory.

scaledSD <- data.frame() %>% tbl_df()
# Read in data (again, sigh)
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), 
         header = TRUE) %>% 
          tbl_df()
  # CI using SD of beta with scaling
  sdScale_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    sdBeta = sd(beta, na.rm = TRUE)) %>%
          mutate(CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * sdBeta * sqrt(K/n)),
              type = 'scaledSD') %>% 
          ungroup() %>% rowwise() %>% 
          mutate(coverage_ind = ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0),
              sim = i)
  # Collect the values over all simulations
  scaledSD <- bind_rows(scaledSD,sdScale_tmp)
}

scaledSDTable_tmp <- scaledSD %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
scaledSDTable <- matrix(scaledSDTable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(scaledSDTable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(scaledSDTable) <- paste('K = ', K_vec, sep = '')
knitr::kable(scaledSDTable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.024 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.104 0.277 0.313 0.352 0.334 0.344 0.323 0.328 0.329 0.328
K = 230 0.143 0.416 0.496 0.520 0.549 0.591 0.593 0.611 0.615 0.629
K = 340 0.153 0.495 0.621 0.653 0.692 0.716 0.726 0.734 0.741 0.745
K = 450 0.196 0.585 0.685 0.733 0.756 0.779 0.794 0.814 0.818 0.813
K = 560 0.204 0.575 0.720 0.743 0.798 0.816 0.835 0.840 0.852 0.858
K = 670 0.245 0.674 0.759 0.804 0.819 0.845 0.861 0.863 0.869 0.879
K = 780 0.252 0.679 0.777 0.806 0.827 0.843 0.866 0.869 0.875 0.874
K = 890 0.282 0.704 0.787 0.826 0.843 0.852 0.878 0.878 0.881 0.884
K = 1000 0.283 0.741 0.819 0.838 0.862 0.872 0.890 0.894 0.895 0.907
Sandwich variance estimator CI
SummedSvar <- data.frame() %>% tbl_df()
# Read in data
for(i in 1:nsim){
  ValSim <- read.table(file = paste0(datLoc, 'uni_beta_vector_simID_', i, '_SCEN_', SCEN, '.txt'),
         col.names = c("beta", "sVariance", "K", "nRSloops", "TrueBeta"), header = TRUE) %>% 
        tbl_df()
  # CI through now summing sandwich variance of beta
  SummedSvar_tmp <- ValSim %>% group_by(K, nRSloops, TrueBeta) %>% 
          summarise(AvgBeta = mean(beta, na.rm = TRUE),  
                    SummedSvar = sum(sVariance, na.rm = TRUE)) %>%
          mutate(bVar = SummedSvar * (1/nRSloops^2),
              CIlow =  AvgBeta - (qnorm(0.025, lower.tail = FALSE) *
                                       sqrt(bVar)),
              CIup = AvgBeta + (qnorm(0.025, lower.tail = FALSE) * 
                                  sqrt(bVar)),
              type = 'avSVar',  sim = i,
              coverage_ind = 
                ifelse(TrueBeta >= CIlow && TrueBeta <= CIup, 1, 0))
  
  # Collect the values over all simulations
  SummedSvar <- bind_rows(SummedSvar,SummedSvar_tmp)
}

SummedSvartable_tmp <- SummedSvar %>% group_by(K,nRSloops) %>% summarise(EC = mean(coverage_ind))
SummedSvartable <- matrix(SummedSvartable_tmp$EC, ncol = length(nRSloops_vec), byrow = TRUE)
colnames(SummedSvartable) <- paste('B = ', nRSloops_vec, sep = '')
rownames(SummedSvartable) <- paste('K = ', K_vec, sep = '')
knitr::kable(SummedSvartable)
B = 10 B = 120 B = 230 B = 340 B = 450 B = 560 B = 670 B = 780 B = 890 B = 1000
K = 10 0.679 0.075 0.008 0.000 0.000 0.000 0.000 0.000 0.000 0.000
K = 120 0.934 0.880 0.817 0.758 0.686 0.640 0.598 0.557 0.527 0.484
K = 230 0.946 0.910 0.860 0.821 0.781 0.749 0.728 0.687 0.659 0.639
K = 340 0.943 0.903 0.877 0.833 0.799 0.768 0.737 0.716 0.703 0.678
K = 450 0.947 0.918 0.882 0.846 0.805 0.776 0.745 0.727 0.706 0.685
K = 560 0.948 0.890 0.850 0.818 0.798 0.769 0.738 0.725 0.689 0.670
K = 670 0.941 0.911 0.862 0.822 0.777 0.743 0.727 0.688 0.672 0.647
K = 780 0.949 0.885 0.854 0.788 0.740 0.722 0.684 0.661 0.646 0.621
K = 890 0.938 0.893 0.840 0.779 0.735 0.723 0.675 0.651 0.639 0.614
K = 1000 0.944 0.886 0.834 0.767 0.716 0.688 0.652 0.632 0.618 0.599